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Abstract 

Graph clustering becomes an important problem due to 
emerging applications involving the web, social networks 
and bio-informatics. Recently, many such applications gen- 
erate data in the form of streams. Clustering massive, dy- 
namic graph streams is significantly challenging because of 
the complex structures of graphs and computational difficul- 
ties of continuous data. Meanwhile, a large volume of side 
information is associated with graphs, which can be of var- 
ious types. The examples include the properties of users in 
social network activities, the meta attributes associated with 
web click graph streams and the location information in mo- 
bile communication networks. Such attributes contain ex- 
tremely useful information and has the potential to improve 
the clustering process, but are neglected by most recent graph 
stream mining techniques. In this paper, we define a unified 
distance measure on both link structures and side attributes 
for clustering. In addition, we propose a novel optimization 
framework DMO, which can dynamically optimize the dis- 
tance metric and make it adapt to the newly received stream 
data. We further introduce a carefully designed statistics 
SGS(C) which consume constant storage spaces with the 
progression of streams. We demonstrate that the statistics 
maintained are sufficient for the clustering process as well 
as the distance optimization and can be scalable to massive 
graphs with side attributes. We will present experiment re- 
sults to show the advantages of the approach in graph stream 
clustering with both links and side information over the base- 
lines. 

1 Introduction 

Recently, there is an increasing need for mining dynamic 
graphs with the rapidly growing social networks, Internet ap- 
plications and communication networks [1||2|[3|[4|[5|. A 
graph stream is defined as individual graph objects arrive 
continuously over time, which represent various activities 
among nodes in the networks. Such activities can be dis- 
cussion threads in social networks, user click graphs in user 
web browsing sessions and authorship graphs in a dynami- 
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cally updated scientific repository. The nodes of each graph 
object are typically drawn from a massive domain, such as 
users in social networks, IP addresses in Internet applica- 
tions and terminals in communication networks. Although 
each graph object is in a modest size, the total number of dis- 
tinct nodes and edges in the aggregated data from the stream 
can be extremely large. 

Many existing approaches on graph stream mining have 
been devised to solve different tasks, including clustering 
Q~), classification [3 j, outlier detection [4|, etc. All these ap- 
proaches on graph streams are primarily designed for mining 
the link structures among graph objects only. However, in 
many real world applications, there are many side attributes 
associated with graphs that can be potentially highly useful 
to the mining tasks. Some examples of such attributes are 
listed as follows: 

• In social networks, many social activities are generated 
daily in the form of streams, which can be naturally 
represented as graphs. In addition to the graph repre- 
sentation, there are tremendous side information associ- 
ated with social activities, e.g. user profiles, behaviors, 
activity types and geographical information. These at- 
tributes can be quite informative to analyze the social 
graphs. We illustrate an example of such user interac- 
tion graph stream in Figure Q] 

• Web click events are graph object streams generated 
by users. Each graph object represents a series of web 
clicks by a specific user within a time frame. Besides 
the click graph object, the meta data of webpages, 
users' IP addresses and time spent on browsing can all 
provide insights to the subtle correlations of click graph 
objects. 

• In a large scientific repository (e.g. DBLP), each single 
article can be modeled as an authorship graph object 
[1|[4|. In Figure |2] we illustrate an example of an au- 
thorship graph (paper) which consists of three authors 
(nodes) and a list of side information. For each article, 
the side attributes, including paper keywords, published 
venues and years, may be used to enhance the mining 
quality since they indicate tremendous meaningful rela- 
tionships among authorship graphs. 

Thus, it is much desired that the mining process can in- 
corporate both links and side information to further improve 
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Figure 1 : An Example of a (Directed) Social Activity Graph 
Stream with Side Information 
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Figure 2: An Example of an (Undirected) Authorship Graph 
with Side Information 



its effectiveness. In this paper, we propose a framework to 
cluster graph streams with side information. The challenges 
of this problem are two-fold: 

(1) Both graphs and side information are drawn from 
massive domains, which can not be explicitly held in the 
memory. For example, consider the number of users in a 
social network to be N. The potential number of distinct 
interactions (edges) can be as large as the order of N 2 . 
Many side information such as IP address, tags, tokenized 
text and geographical information can also be extremely 
large. Even the summaries of those incoming data are 
rapidly growing with the streams and eventually unable to 
be explicitly stored. In addition, the problem becomes 
particular challenging in the stream scenario due to the high 
rate of incoming streams. Storing data in the hard disks and 
offline processing will not be able to efficiently handle the 
high volume of streams. 

(2) Different types of side information give different in- 
dications on the nature of clustering, because many side at- 
tributes are quite noisy and insignificant. In other words, 
each side information type has its own degree that con- 
tributes to the underlining clustering. For example, while 
clustering individual graphs as shown in Figure |2] from a 
large scientific repository, the aim is to group papers from 
the same research area into the same cluster. Thus, the links 



representing co-authorship as well as attributes of keywords 
and venues can be quite indicative to which cluster an indi- 
vidual graph object d should be grouped into. However, 
the attributes including paper published years might not be 
that useful to cluster individual papers, since many research 
papers in different areas appear in various specialized con- 
ferences every year. Thus, considering both the linkage and 
side information, it is non-trivial to qualitatively measure the 
importance of each side information type as well as links. 

In this paper, we first define a unified distance measure 
E-S Distance which combines the distances with regard to 
linkage and side information. Then we propose a novel 
optimization framework DMO which dynamically learns and 
tracks the importance (weights) of links and side attributes. 
The optimization framework on links and side attributes 
is periodically examined. It adjusts the weights to make 
the graphs within a cluster to be as coherent as possible 
while the graphs from different clusters to be as distinct as 
possible. Since efficiency is critical to stream algorithms 
and the data size is massive, it is not realistic to explicitly 
store all received data in the memory. We introduce a 
sketch based compression framework SGS(C), which can 
store the statistics of heterogenous data including edges 
and side attributes. More importantly, we demonstrate 
that DMO can be efficiently and dynamically solved in the 
sketch representation given by SGS(C). We show that 
the proposed approach consumes constant memory with 
the growing incoming data and can be used to estimate 
all measures in the clustering algorithm as well as the 
optimization framework DMO. 

The rest of this paper is organized as follows. In Sec- 
tion |2] we discuss related work on graph stream cluster- 
ing. In Section [3] we define a unified distance metric E-S 
Distance on graph objects with side information. Then we 
present a novel optimization framework DMO to dynami- 
cally refine the distance measures with the progression of the 
stream. In Section|4] we propose the statistics SGS(C) and 
how to use SGS(C) to estimate in the clustering algorithm. 
We report the experiment results in Section |5]and present the 
conclusion in Section|6] 

2 Related Work 

In the literature, a number of techniques have been pro- 
posed to mine graph and network data [6|[7|[8|[9|. Tradi- 
tional graph clustering methods are extensively studied on 
the node clustering setting of a single static graph, including 
graph partitioning ifTOl . minimum cut [11], heterogeneous 
networks lfl2l and dense subgraph mining [1131 . The con- 
text of node clustering is to group similar nodes together 
based on linkage behaviors of a single large graph. Beside 
using only linkage information, Zhou et al. [ 14 1 proposed a 
random walk based approach to cluster a single static graph 
by examining structural and attribute similarities. All these 



techniques are only applicable to the nodes in a static in- 
dividual graph, rather than to cluster many graph objects 
whose nodes are drawn from a massive domain. 

A number of approaches are also proposed in the con- 
text of object clustering, which are designed to cluster many 
graph objects. The difference between node clustering and 
object clustering is that object clustering aims to cluster 
graph objects rather than nodes from a single graph. Many 
approaches have been proposed to discover the substruc- 
tures of graphs [15] [13] [16]. However, mining subgraphs 
is computationally expensive with multiple passes. There- 
fore they are not applicable to handle continuous massive 
graph streams. [17|[18| are proposed to cluster objects in 
XML data. However, those approaches cannot be scalable to 
a massive number of graphs and nodes, and are only able to 
handle disk-based data rather than stream data. Recently, a 
number of techniques are proposed to mine graph streams. 
HI proposed a method to cluster massive graph streams by 
extending micro-clusters. [3] is designed to construct the 
summary of graph streams and classify graph objects by 
scanning each of them only once. A structural connectiv- 
ity model is proposed in [4| to identify outliers in massive 
network streams. [5] designed a graph sketch technique to 
estimate and optimize the queries on graph streams. How- 
ever, all above approaches only consider the structure infor- 
mation among graphs, whereas neglect the side information 
associated with each graph object. Thus, many meaningful 
relationships and correlations of graph objects might not be 
discovered in the mining process. 

Using side information to analyze record-based data 
within feature spaces are extensively studied in the context 
of distance metric learning. Distance metric learning studies 
the problem to learn proper distance metrics over inputs. |fi"9l 
proposed a global distance metric learning approach under a 
supervised setting. In addition, a number of approaches e.g. 
[20|[21] are designed to learn local adaptive distance met- 
rics with supervised information. In an unsupervised setting, 
Principle Component Analysis (PCA) and Multiple Dimen- 
sion Scaling (MDS) are widely used to reduce dimensions 
using a linear strategy. A detailed survey on distance met- 
ric learning can be found in l22l . However, these methods 
cannot be easily generalized to graph data, especially to dy- 
namic graph streams. 

3 Distance Optimization 

We first introduce some notations and definitions that will 
be used throughout the paper. Assume we have a stream of 
graphs Q denoted as {Gx, G2, G n , ■■■}, where each graph 
Gi is drawn on the subset of massive nodes N. We use 
set £ to represent the set of distinct edges from all graphs: 
£ = {(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,{X n ,,Y n ,),...}. Specifically, 
Xj and Yj are the two nodes of each edge (Xj, Yj), and each 
graph Gi contains a subset of edges from set £ . We assume 



the frequency of edge (Xj ,Yj) in a graph Gi is denoted by 
F(Xj, Yj, Gi). For example, in communication networks, 
the frequency may represent the duration of conversations 
between two parties. The frequency may also be implicitly 
set to 1 in many applications to reflect the link relationships 
between two nodes. 

Associated with the graph stream, we also have d differ- 
ent types of side information denoted by T = {Ti, T^}. 
For example, the side information in authorship graph 
streams may contain different types of side attributes, such 
as publication years, conferences and paper keywords. We 
note that some attributes are associated with the whole graph 
while some attributes can also be associated with individ- 
ual nodes or edges. We take the aggregated side attributes 
of nodes and edges, then append them to the whole graph. 
In addition, each graph may contain multiple side attributes 
of the same type, e.g. a paper has multiple keywords. Let 
Si = {Sn, Si n , ■••} be all distinct side attributes of type 
Ti, where I = 1, d. For example, if the type of Ti is "key- 
word", Si stores all the distinct keywords appeared in the 
stream. The value of side attribute 5/„ of type 1} associated 
with graph Gi is represented as V(Si n , Gi). For example, 
suppose the side attribute "database" as a type "keyword" 
appears in the paper Gi for 3 times. Then its correspond- 
ing value is 3. Clearly, V(Si n , Gi) is if graph Gi does not 
contain the side attribute Si n . 

The goal of the stream clustering framework is to clus- 
ter graph objects into k clusters, which are denoted by 
C\, C2, Cfe. Each incoming graph object from the stream 
is dynamically assigned to the most appropriate cluster, and 
the cluster is updated in real-time. Suppose we have a clus- 
ter Ci containing a set of graphs {G^ , G ln }. The implicit 
graph defined by the aggregation of graphs {Gi 1 , Gi n } is 
denoted by H(Ci). In other words, H(Ci) represents the 
summarization of graphs in cluster Ci. We use N(Ci) to de- 
note the number of graphs in cluster Ci. The above notations 
are summarized in Table Q] 

3.1 Preprocessing We propose a general framework to 
cluster both directed and undirected graphs. The edges can 
either be weighted or unweighted. The side information can 
also be of different formats. The notations used in the main 
paper implicitly assume each graph object in the stream is a 
directed graph. For undirected graphs, we convert them to 
directed graphs by applying lexicographic ordering on node 
labels. Thus, all notations can be simply reused for the case 
of undirected graphs after the conversion. In the meanwhile, 
we assign the frequencies of all edges to be 1 if no frequency 
information is provided in the graphs. 

We consider a general case that each side attribute is nu- 
meric. The binary and categorical attributes can be converted 
to numeric attributes in a straightforward way. Specifically, 
binary attributes are special cases of numeric attributes. In 



Table 1: Notations 



Cj on the side information of type 1} is denned as: 



Symbol 


Description 


Q = {Gi, G n , ...} 


graph streams 


£ = {{XuY^ 
(X n ,,Y n ,),...} 


all distinct edges 


T={T 1 ,...,T d } 


d types of side information 
with the stream 


Si = {Sn, Sl n , ■■■} 


all distinct side attributes of 
type Ti, where I = 1, d 


F(Xj,Yj,Gi) 


the frequency of edge 


V(Si„,Gi) 


the value of side attribute 
Sin in type T; associated 
with d 


C\,C2, ■ Ck 


k clusters 


H(d) 


aggregated graphs in cluster 

Ci 


N(Ci) 


the number of graphs in 
cluster d 



addition, for categorical attributes, different categorical val- 
ues can be assumed to be separate binary attributes. For side 
attributes that are associated with individual nodes or edges, 
we compute the aggregated values for each whole graph. 

3.2 Distance Definitions In order to cluster graph objects 
into a set of k clusters such that similar graphs are grouped 
into the same clusters, a distance function is required to mea- 
sure the similarities between graphs and clusters. Suppose 
we have a newly arrived graph d . For each cluster Cj where 
j = 1, 2, k, the distance between d and Cj is calculated 
by a distance function d(Gi, Cj). The new graph Gi will be 
grouped into its nearest cluster which has the minimum dis- 
tance among all k clusters. We note that each new graph con- 
tains both edge information and side information. Therefore, 
we define two types of distances for edge and side informa- 
tion respectively. The quadratic edge distance between the 
graph Gi and cluster Cj is defined as: 

d e (Gi, Cj) = 

(3i)^(F ( x t ,y t ,^- f(X ^ (Cj)) 

where to is the number of distinct edges received. In the 
above definition, all edges are enumerated and summed to 
the distance measure. However, we notice that only the edges 
contained in Gi or H(Cj) contribute to the summation. 
Since H(Cj) is the summarization of all graphs in the 
jth cluster, the frequencies of edges F(X t ,Y t , H(Cj)) are 
normalized by the number of graphs in the cluster. Similarly, 
the quadratic side distance between the graph Gi and cluster 
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where to/ is the number of distinct side attributes of type T\. 
Let vector D {G u Cj) = [d e (G t ,Cj), d s (G l ,C ] ,T 1 ), 
d s {G l ,C j ,T d )] T . The dimension of D{G l , Cj) isd+l. 
Given the definitions of edge distance and side distances, 
we can further define the E-S distance on edge and side 
information: 

Definition 1. (E-S Distance) The E-S distance on edge 
and side information is defined as: 

d 2 {G l ,Cj) = \\G t -C 3 \\\ 

d 

= w ■ d 2 e (G u Cj)+J2 ( w i ■ dKGuCj,^)) 
i=i 

(3.3) = D(Gi,Cj) T AD(Gi,Cj), A^i) 

where w is the weight of edge distance and wi(l = 1, d) 
are the weights of side distances. Matrix A is a diagonal 
matrix diag(wo, u>i, u>d) representing the weights of edge 
and side distances. 

In the above definition, matrix A is required to be positive 
semi-definite A )p to ensure the E-S distance be non- 
negative. 

3.3 Dynamic Multi-distance Optimization (DMO) A 

straightforward E-S distance measure may be using a (d+ 1)- 
dimensional identity matrix Id+i = diag(l 7 1, 1) as the 
matrix A. Thus, all edge distance and side distances are as- 
signed to have equal weights 1. However, different types of 
attributes and link information give different indications to 
the clustering. Suppose we cluster authorship graphs accord- 
ing to research areas. The coauthor relationships and paper 
keywords are clearly more important than author affiliations 
and publication years. The reasons are that researchers from 
the same area tend to collaborate, and papers of the same 
area are more likely to share the same keywords. Therefore, 
assigning equal or manually predefined weights to E-S dis- 
tance cannot be generalized to vast real-world applications 
on massive graph streams. 

In order to dynamically learn the weights of distances 
in matrix A with the progression of the stream, we consider 
minimizing the intra-cluster distances of graphs received 
so far: 



(3.4) 



m i n E E 



Gi - Cj 



A trivial solution of this optimization problem is A = 0. 
Thus, we further add a series of constraints to regulate the 
pairwise inter-cluster distances between cluster centroids: 

(3.5) ||Cj -CjWa > c, for i,j = l,...,fcandi ^ j 

Here, the definition of inter-cluster distance \\Ci — Cj\\a is 
a natural extension of Eq. 13 . 3pl c in Eq. 13 .51 is an arbitrary 
positive constant which only affects the scales of weights. 
Thus we set c to 1. The optimization framework is given 
below: 

k 

m i n E E \\ G i- c j\\A 
j i '.• < <■ 

s.t. \\d - Cj\\a > 1 = 1, ...,kmdi^j) 

(3.6) A is diagonal, A )>= 

The idea of the optimization is to let the graphs within the 
same clusters to be as coherent as possible and graphs from 
different clusters can be separated well. 

LEMMA 3.1. The proposed optimization framework in 
Eq. \3.6\ is a convex optimization problem. 

Proof. From the definition in Eq. 13.31 it is clear that the ob- 
jective function is a linear function on A. Thus the objec- 
tive function is a convex function. The inter-cluster distance 

l li 

constraints can be rewritten as 1 — [\\Ci — Cj|m < 

1/2 

for i, j — 1, k and i ^ j. Since [\\Ci — Cj \\\) is con- 
cave, the inter-cluster distance constraints are convex func- 
tions. It is also straightforward to verify the constraints on 
matrix A are convex [23 j. Thus, the optimization framework 
in Eq. I3.6l is a convex optimization problem. 

We propose Dynamic Multi-distance Optimization 
(DMO) to solve the above optimization framework, and we 
use DMO to dynamically refine the weights of graph edges 
and side attributes. 

LEMMA 3.2. (DMO) The solution of proposed optimiza- 
tion framework in Eq. 13.61 can be approximated by solving 
the following form: 

k 

m i n *£ E \\ G *- c A\\- 

k 

(3.7) E lo9{\\Ci-CA\A-l),t>Q,A^{) 

i=l 



Z/2 -distance is not used here to prevent matrix A always being rank 1. 



Proof. Since the inter-cluster distance inequality constraints 

1 /2 

can be rewritten as 1 — (||Cj — CjH^J < 0, we define the 
log-barrier of the problem as: 

k 

(3.8)0(A) = -£ E logQlCt-CjU-l) 

Eq. l3.7l can be directly derived by applying the log-barrier to 
the objective function in Eq. 13.61 1231. Here, t is a positive 
parameter of the logarithmic barrier method. 

We will describe the details on efficient distance estima- 
tion in Section|4] Suppose the distances on edges and side in- 
formation are available, Eq. |3.7| in DMO can be solved by us- 
ing the gradient descent algorithm. Specifically, the matrix A 
is initialized to be an identity matrix which gives all distances 
equal weights. For every 7 graphs clustered, a gradient de- 
scent search is applied to Eq. 13.71 and weights are dynam- 
ically optimized based on the newly received graph edges 
and side attributes. By enabling DMO, the adjusted weights 
ensure that the intra-cluster distances are minimized and the 
inter-cluster distances are maximized. Thus, the weights of 
both edge and side information can be gradually and dynam- 
ically refined throughout the streams!! 

4 Sketch-Based Clustering Framework 

One challenge of stream mining is the growing size of avail- 
able data. This problem is especially critical on the graph 
data with side information. On the one hand, graphs are 
drawn from a massive set of nodes in many real applica- 
tions, and the number of possible edges are quadratic with 
the number of nodes. On the other hand, the volume of side 
information can also be quite large. Furthermore, both the 
sizes of edge and side information are growing with more 
and more data received. When the sizes become extremely 
large, this brings enormous difficulties to maintain all data 
in the memory. In this section, we propose a carefully de- 
signed sketch-based framework to maintain the statistics of 
incoming data. The proposed framework considerably re- 
duces the storage requirement and only requires constant 
memory spaces with the streams. We also demonstrate how 
to use the statistics maintained to accurately estimate the key 
measures in the clustering process as well as the optimization 
framework DMO. 

4.1 Preliminaries Sketch approaches are generic methods 
to approximate aggregation functions in the data stream do- 
main. We adapt Count-Min sketch [24] to estimate frequency 
statistics of data points, and extend it to the context of graphs 



2 Since the weights are refined gradually, every update takes only a few 
search steps. We do not use the Newton method because it is usually slower 
due to the matrix inverse at each update. 



with side information. Sketch approaches are generic meth- 
ods to approximate aggregation functions in the data stream 
domain. We adapt Count-Min sketch [24] to estimate fre- 
quency statistics of data points, and extend it to the con- 
text of graphs with side information. In each sketch table, 
we maintain a two-dimensional array with w ■ h cells with 
w = [~ln(l/<J)l rows and h = \e/e] columns, where e is the 
base of the natural logarithm. In addition, w different hash 
functions fx , f w are randomly generated from a pairwise- 
independent family. Each hash function corresponds to one 
of 1 -dimensional row arrays with h cells in the sketch. When 
a new data point di arrives, each hash function fj is applied 
to di and maps it to a hash value Vj with range [0, h — 1] . For 
the jth hash function, the frequency of data point di is added 
to the Vjth column on the jth row of the sketch. Thus, only 
one cell on each row is updated, and there are w cells in the 
sketch table that are incremented by the frequency of di. 

In order to estimate the frequency of a data point, we 
map the data point to w cells in the sketch table by applying 
the w hash functions. The frequency of the data point is 
determined by the minimum value among all these w cells. 
We notice that the sketch table can only overestimate the 
actual values, since the frequencies are non-negative and 
cells are updated by addition. As shown in ll24l . the estimate 
guarantees that the overestimate is no more than e • T with 
probability at least 1 — S for a data stream with T arrivals. 
This probabilistic upper bound shows that increasing w and 
h can get more accurate estimation. Its sensitivity on the 
size of the sketch table has been studied in previous work 
HGU. In the following, we will present how to apply 
sketches to statistics maintenance on graph streams with side 
information. 

4.2 Sketch Based Statistics Instead of storing the explicit 
edges and side information, we maintain the following statis- 
tics: 

DEFINITION 2. The Statistics of Graphs with Side informa- 
tion SGS(C) maintained in the memory for each cluster C 
is defined as {ESketch(C), ER{C), SSketch(C,l...d), 
SR{C, l...d), N(C), T(C)}. Each component in SGS(C) 
is defined in details as: 

• ESketch{C). one w ■ h sketch table storing first 
moments of edge frequencies. 

• ER(C). the summation of second moments of edge 
frequencies: ER(C) = Ec.ec Etli F 2 { x t, Y t, Gi). 

• SSketch(C, l...d). d w ■ h sketch tables storing first 
moment values for d side attribute types correspond- 
ingly. 

• SR(C\ l...d). a vector with length d containing the 
summation of second moments of side attribute values: 
SR(C, I) = E Gi£C ESi V 2 {Su,Gi), I = 1, d. 



• N(C). the number of graphs in the cluster C. 

• T(C). the most recent timestamp of the cluster being 
updated. 

When a new incoming graph Gt is assigned to a cluster 
C, the statistics in SGS(C) are updated as follows. For each 
edge (Xi, Yi), w hash functions are applied to Xi © Yi and 
the hash values are used to determine w cells in the sketch 
table ESketch(C). Here, © is the concatenation operator 
on the node label strings. Those w cells are incremented 
by F(Xi, Yi,Gt). In the meanwhile, the second moment of 
its edge frequency is added to ER(C). Similarly, each side 
attribute in graph Gt is hashed into SSktch(C, l...d), and 
SR(C, 1...G0 is updated based on the second moment value. 
Lastly N(C) is incremented by 1 and T(C) is updated to the 
current time. 

Since none of the components' sizes in SGS(C) grow 
in the update, the statistics maintained always keep a con- 
stant storage with the progression of the stream. Further- 
more, another advantage of SGS(C) is that the storage used 
by SGS(C) can be easily adjusted by setting the sizes of 
sketches to adapt the local hardware requirement. We fur- 
ther observe that SGS(C) follows the additive property: 

LEMMA 4.1. The statistics maintained in Definition\2\fol- 
lows the additive property. In other words, SGS(Ci U C 2 ) 
can be computed as a function of SGS{C\) and SGS(C 2 ). 

Proof. The sketch table in ESketch(Ci U C 2 ) can 
be computed by additions of two-dimensional ar- 
rays in ESketch{C\) and ESketch{C2). Similarly, 
SSketch(Ci U C2, 1...G0 is also the summation of sketch 
tables S 'Sketch(C u 1... d) and SSketch(C 2l l-d). For 
second moments, 

ER{d U C 2 ) 

m 

G i e(c 1 uc 2 ) t=i 

m in 

= E E^ 2 ( x '< r '< G *)+ E E f2 ( x '^ g ») 

G.GCi t=l Gi£C 2 t=l 

= ER{C\) + ER(C 2 ) 

Likewise, SR(d U C 2 ,l...d) = SR(C u l...d) + 
SR(C 2 , l...d). N(Ci U C 2 ) is the number of total graphs 
in Ci and C 2 , thus N{d U C 2 ) = N{C X ) + N(C 2 ). 
T(Ci\JC 2 ) is the most recent timestamp of C\ and C 2 , hence 

T{d U C 2 ) = max(T(d),T(C 2 )) . 

4.3 Algorithm with Side Information (GSSClu) Here, 
we present the algorithm for clustering graph streams with 
side information. The input of the clustering algorithm is the 
number of clusters k. The only information we store is the 



set of cluster statistics {SGS(Ci), SGS(C k )}- As the 
initialization step, we set A to be an identity matrix which 
gives equal weights to the edge and side information. For the 
first k received graphs, we create k singleton cluster statistics 
SGS(Ci), i — 1, k respectively. While the initialization 
may not create a well-separated clustering, these k clusters 
will be further stabilized in the subsequent steps. For each 
new graph Gi, we compute the E-S distance on edge and side 
information between Gi and k clusters. Assume C m j„ is the 
closest cluster to Gi among all k clusters. We also want to 
measure the structural spread of the cluster C m i n since Gi 
may not necessarily belong to cluster C m in- The reason is 
that Gi might be an outlier or represent a new cluster, despite 
that it has the shortest distance to C m in compared with other 
clusters. Thus, we define the structural spread of a given 
cluster Cj as a function of the mean square radius of Gf. 



(4.9) S(Cj 



P 



Cj\\a 



Here, the spread S(Cj) is defined as the mean square radius 
of the cluster Cj multiplied with a factor p0 If the graph Gi 
is within the spread of C m i n , Gi is assigned to cluster C m i n 
and the statistics of SGS(C m i n ) is updated accordingly. 
Otherwise, the graph Gi may be an outlier or represent a 
new born cluster. Therefore, we remove the most stale, 
namely least recently updated, cluster based on the stored 
timestamps, and create singleton cluster statistics from Gi. 
In the meanwhile, for every 7 graphs obtained from the 
stream, we dynamically optimize the matrix A based on 
newly received information using Eq. I3.7l defined in DMO. 
Thus, the weights of both edge and side information can be 
actively learned and adjusted with the evolving stream. The 
detailed description on the clustering method GSSClu can be 
found in AlgorithmQ] 

4.4 Key Measures Estimation Next, we will illustrate 
how to compute the measures in GSSClu using the statistics 
maintained by SGS(C). 

LEMMA 4.2. The statistics maintained in SGS(Cj), j = 
1, k are sufficient to compute all measures required by 
the clustering algorithm GSSClu. 

Proof. From Algorithm [TJ it is clear that the clustering 
process requires the following measures: 

• \\Gi—Cj \\\ (Eq. [33T i: the E-S distance between a newly 
received graph Gi and a cluster Cj . 



Algorithm 1: Clustering Graph Streams with Side 
Information (GSSClu) 

Input: k: number of clusters 

Initialize cluster statistics set to be an empty set; 

A = I d +i; 

graph-count = 0; 
foreach newly received graph Gi do 
graph-count = graph-count + 1; 
if i < k then 

create singleton cluster statistics SGS(Ci) 
by inserting Gf, 
continue; 

end 

for j — 1 to k do 
I compute || Gi — Cj\\ 2 A defined in Eq. 13.31 
end 

let Cmin be the closest cluster; 



if WGi - C 



min 1 1 A 



< S(C m i n ) then assign d to 



SGS(C m i n ) ; 

else replace least recently updated cluster 
statistics by singleton cluster statistics created 
from Gi ; 

if graph _count % 7 == then 

I adjust A by optimizing Eq. 13.71 
end 



end 



• \\Cj— Cj\\ a (Eq. [3~5l i: the inter-cluster distance between 
two clusters. 

• S(Cj) (Eq. |4.9b : the structural spread of a cluster Cj. 

All these four distance measures are computed based on 
combinations of side information distances and edge dis- 
tance. We will only show how to compute the measure re- 
lated to the side information due to the space limitation. The 
computation in terms of the edges can be derived in a similar 
way. 

E-S Distance of New Graphs: The distance between an 
incoming graph Gi and a cluster Cj on the side information 
of type Ti is defined in Eq. 13.21 It can be expanded as: 

d 2 s (G i: Cj,Ti) 



t=i 



V(S U ,H(C 3 )) 
N{C 3 ) 



N(Cj) 



YsG.ec W G i ~ C jW 2 A (Eq. EUl: the intra-cluster dis- 
tance of^ a cluster Cj where Gi represents all graphs (4-10) + / j 



mi T/2 



clustered in C, 



1 We use p = 3 in accordance with the normal distribution assumption. 



V \S lu H(C 3 )) 
N 2 {C 3 ) 



Since Gi is the newly received graph from the stream, its 
side attributes are available and known exactly. Therefore, 



the first term V 2 (Su, Gi) in Eg. 14. 101 can be computed ex- 
actly. In the second term, only a non-zero value of both 
V(Sit,Gi) and V(Sit, H{C 3 )) will add up to the summa- 
tion. Hence, instead of computing all mi side attributes, we 
only need to enumerate all side attributes contained in Gi. 
V(Sit, H(Cj)) can be directly estimated from the sketch ta- 
ble SSketch(Cj,l) in SGS(Cj), whereas the exact value 
of V(S u ,Gi) is known. N(Cj) is also stored in SGS(C 3 ). 
The third term can be computed by performing pairwise self 
products of each row in SSketch(Cj, I). The minimum of 
these w values divided by N 2 (Cj) is used as the estimate (4-13) 
value. 

Intra-cluster Distance: The intra-cluster distance of 
cluster Cj is defined as the sum of distances between every 
graph clustered in Cj and Cj's centroid. Different from the 
previous computation in Eq. 14.101 one should note that the 



graphs clustered in Cj are not explicitly stored. Hence, the 
estimation from Eq. 14. 101 cannot be directly reused. For the 
side information of type 2], it can be expanded as: 



E d 2 s {Gi,Cj,Ti] 
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V 2 (S lu H{C J )) 
N 2 (Cj) 



From Eq. 14.1 II one can observe that SR(Cj,l) and 
N{Cj) are both explicitly maintained in SGS{Cj). 
J2t=i V 2 (Slt> H{Cj)) in the second term can be estimated 
by pairwise products of each row in SSketch(Cj, I). Thus, 
the intra-cluster distance can also be computed from the 
statistics maintained. 

Inter-cluster Distance: The inter-cluster distance is 
defined as the distance between two clusters' centroids in 
Eq. 13.51 The inter-cluster distance in terms of side informa- 
tion 7] is: 



d 2 s (C 

mi 

E 

mi 

E- 

mi 

t=i 



. Cj,Ti) 



V(Sit,H(Ci)) V{Sit,H(Cj)) 



V 2 (S lt ,H{Ci)) 
N 2 {C % ) 



As shown previously, the first and third terms can be com- 
puted by pairwise self products of sketches in SGS(Ci) and 
SGS(Cj) respectively. Similarly, the second term can be 
computed by the product of each row from SSketch(Ci, I) 
and SSketch(Cj, I), and the minimum of w rows is used as 
the estimate value. 

Cluster Structural Spread: From the definition in 



Eq. 14.91 the spread of a cluster Cj related to the side 
information type T\ can be represented as: 

P 



J2 C,,T,) 



Since 



N(C 3 ) 

d 2 (Gi,Cj,Ti) can be estimated from 



-E2 



t=i 



N(Cj) 

N(Ci)N(Cj) 



V 2 {S lt ,H{Cj)) 
N 2 {C 3 ) 



Eq. 14.1 II the structural spread can be also computed from 
the statistics. 

Therefore, all measures used in Algorithm [1] includ- 
ing DMO can be estimated by the statistics maintained in 
SGS(Cj), j = l,...,k. We further observe that the accura- 
cies of estimations are directly related to the sketches, which 
are bounded by the probabilistic upper bound described ear- 
lier. 

5 Experimental Results 

In this section, we present the effectiveness and efficiency of 
the proposed clustering scheme with a number of baselines 
on real data sets. We refer to our approach as the GSSClu 
method, since it is designed for Graph Stream with Side 
Information Clu stering. 

5.1 Data Sets We use two real data sets, namely CORA 
and IMDB, to evaluate the GSSClu method. We use two real 
data sets to evaluate the GSSClu method. The details of these 
two data sets are listed as follows: 



CORA Data Set: The first data set that we use in the 
evaluation is the CORA data sefl The CORA data set 
consists of 19,396 scientific articles in the computer sci- 
ence domain. In order to compose author-pair graph 
streams from the scientific publications, we consider 
each scientific article as a graph object with co-author 
relationships as edges as in |[l][4|. We use the research 
topics of research papers as the ground truth to evalu- 
ate the clustering quality. In the CORA data set, all re- 
search papers are classified into a topic hierarchy, with 
73 sub topics on the leaf level. We use the second level 
topics as the labels to evaluate. There are 10 topics in 
total, which are Information Retrieval, Databases, Arti- 
ficial Intelligence, Encryption and Compression, Oper- 
ating Systems, Networking, Hardware and Architecture, 
Data Structures Algorithms and Theory, Programming 
and Human Computer Interaction. Each paper has an 
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average 3.3 authors. For the side attributes, we obtain 
two types of side information to assist clustering: terms 
and citations. The terms are extracted from the paper 
titles, and citations include a list of papers that a given 
article cites. One paper cites 4.3 papers and has 6. 1 dis- 
tinct terms in average. 

• IMDB Data Set: The Internet Movie Database is an 
online collection of movies and television shows, which 
also contains the related information, such as actors, 
directors, production crew, etc. We obtain a sample of 
IMDB data set, which covers ten-year movie data from 
the year of 1996 to 2005 in USA. We further process 
the data set to compose an actor-pair graph stream from 
it. Each movie is considered as a graph object. Actors 
of the movie are nodes and actor-pairs are considered 
as edges within the graph. In order to evaluate the 
effectiveness of the proposed clustering method, we use 
the movie genre as the label. We extract the movies of 
the top four genres from the IMDB, including Short, 
Drama, Comedy and Documentary. In addition, we 
remove the movies which have more than one label. In 
total, there are 9,793 movies, which consist of 1,718 
movies from the Short genre, 3,359 movies from the 
Drama genre, 2,324 movies from the Comedy genre and 
2,392 movies from the Documentary genre. One movie 
graph object has 24.6 edges in average. 

Moreover, we extract three side information types as- 
sociated with the actor-pair graphs, namely plot key- 
words, producers and directors. We extract words by 
tokenizing movie plots. After stop words removal, the 
frequent words are used as keywords. We notice that 
the distinct keywords, producers and directors from the 
whole data set are very large due to the data sparsity. 
In average, one movie graph object has 16.3 keywords, 
1.1 producers and 1.4 directors. 

5.2 Methods In order to demonstrate the effectiveness and 
efficiency of the proposed approach, we compare GSSClu 
with a number of baselines. Since there is no known method 
to cluster graph streams with side information, we use the 
following approaches to show the performance of GSSClu 
from different perspectives: 

(1) GMciro: fl] proposed GMicro to cluster graph 
streams by extending the micro-cluster model. GMicro is 
the best known method to cluster fast and high volume graph 
streams by considering the similarities of edge structures. 
However, this method only considers the linkage within 
graphs, and cannot utilize the massive side information 
associated with graphs to enhance the clustering process. 

(2) GSSClu [w/o opt.]: Since GMicro does not use 
side information to cluster, we use a variation of GSSClu 
to demonstrate the power of dynamic distance optimization 



framework DMO as shown in Eq. l3.7l for a fair comparison. 
Instead of dynamically optimizing the importance among 
links and various side attributes, this method assigns them 
with equal weights as a simplified version of GSSClu. We 
refer to this approach as GSSClu [w/o opt] in all following 
figure legends. 

(3) Disk-based GSSClu: In order to show the effective- 
ness of the proposed sketch-based framework SGS(C), we 
develop another variation of GSSClu by computing the ex- 
act values of all metrics in the clustering algorithm. Due to 
the massive size of the incoming stream data and its growing 
nature, the data can only be stored on the hard disk to avoid 
the out of memory problem. We refer to this approach as 
Disk-based GSSClu. By comparing it with GSSClu, we can 
understand how close the sketch-based framework SGS(C) 
can estimate the true values. However, one should note that 
Disk-based GSSClu is about 5 to 10 times slower than GSS- 
Clu due to the long response time of disk queries. 

5.3 Metrics and Settings The goal of the evaluation is 
to examine if the proposed approach can effectively use 
linkage and side information from the streams to improve 
the clustering results over the baselines. In order to test the 
effectiveness of the proposed scheme, we use the cluster 
purity measure [jij to evaluate the clustering quality. For 
each data set, the labels of graph objects are known but 
excluded from the clustering process. We only use the labels 
to measure the quality of clustering. Specifically, for each 
generated cluster, we compute the dominate class labels from 
the graph objects within the cluster. The purity of each 
cluster is computed as the fraction of graph objects in the 
cluster which belong to the dominate class label. We report 
the average purity scores of different clusters as the cluster 
purity measure. We note that the cluster purity ranges from 
to 1, and 1 represents a perfect clustering result. Clearly, a 
good clustering will provide a high value of the cluster purity. 
For efficiency, we report the processing rate for the proposed 
method and baselines. 

Unless otherwise mentioned, the default parameter 7 is 
set to be 250. We also test the sensitivity over 7 at the end of 
this section. The number of hash function is set to be 10 and 
the width of sketch table is set to be 500 for all baselines and 
the proposed method. The default values of k are 10 for the 
CORA data set and 8 for the IMDB data set, which are the 
same settings as in [4|[25|. 

5.4 Effectiveness Results We first show the effectiveness 
results for the CORA and IMDB data sets. The effectiveness 
results for GSSClu and baseline algorithms with increasing 
number of processed graphs are shown in Figures |3](a) and 
(b). The number of graphs processed is shown on the X-axis, 
whereas the cluster purity is illustrated on the Y-axis. 

For the CORA data set, we can observe that all four ap- 
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proaches achieve stable performance along the progression 
of the stream. The reason is that all these four approaches 
are designed to process stream data. Since the CORA data 
set has 10 labels, a random assignment will generate clusters 
with purity roughly at 0. 1 . From the figure, GMicro achieves 
about 0.33 cluster purity by using links only. The perfor- 
mance of GSSClu [w/o opt] is lower than GMicro although 
it uses both the links and side information. This is because 
side information sometimes are quite noisy. Thus, assign- 
ing arbitrary (in this case, equal) weights to links and side 
attributes may even degrade the clustering quality. We fur- 
ther note that the proposed approach GSSClu has a purity 
score at around 0.45, which gains a performance at least 10% 
over GMicro and GSSClu [w/o opt.] in terms of purity. This 
suggests that the distance optimization DMO can indeed ef- 
fectively learn the importance among links and different side 
attributes. In the meantime, the performances of GSSClu and 
Disk-based GSSClu are quite similar. Disk-based GSSClu is 
only slightly higher than GSSClu in term of purity, which can 
hardly be distinguished from the figure. This further demon- 
strates that the sketch-based approximation maintains the ac- 
curacy of the clustering process. 

For the IMDB data set, the four approaches get similar 
performances for the first 1,000 received graphs. GMicro 
gives an even higher purity score than GSSClu. The reason 
of this is that GSSClu does not get enough statistics to infer 
the importance of links and side attributes with limited data. 
With more and more graphs received, we can observe that 
GSSClu significantly outperforms both GMicro and GSSClu 
[w/o opt.] with at least 0.25 purity improvement. We note 
that a random clustering assignment would get a purity score 
from 0.25 to 0.3, since there are four roughly balanced labels 
in the IMDB data set. The trend for the three baselines is 
similar to the one of the CORA data set. From the results of 
both data sets, it is clear that GSSClu works especially well 
when a reasonable number of data points is received, because 
the optimization framework DMO can dynamically adjust 
the weights and compute a meaningful unified distance 
metric. In the meantime, GSSClu is superior to GMicro 
and GSSClu [w/o opt.] with the stream, and the sketch- 
based estimation is very close to the exact computation. In 
other words, the differences between the estimated values 
from SGS(C) and the exact values calculated by Disk-based 
GSSClu are extremely small and do not typically lead to 



quantitative clustering difference. 

5.5 Efficiency Results We also test the efficiency results 
of GSSClu and baselines on the real data sets. Disk-based 
GSSClu is 5 to 10 times lower than GSSClu due to the slow 
disk access. Thus, we do not show the efficiency result on 
Disk-based GSSClu. The results of GSSClu and other two 
baselines are shown in Figure |4] (a) and (b). In each figure, 
the X-axis shows the progression of the stream in terms of 
time, whereas the Y -axis illustrates the stream processing 
rate. The processing rate is computed based on the number 
of edges processed per second. The reason that we do not use 
the number of graphs processed per second is graphs could 
have skewed sizes. Some graphs are very large which need 
longer time to process, and then its low processing rate in 
terms of the number of graphs does not reasonably represent 
the underlying efficiency. 

From both figures, one can observe that GMicro 
achieves the best efficiency. The reason that GSSClu based 
approaches consume more running time is GMicro only pro- 
cesses linkage data. The side information data has the same 
order of magnitude as the linkage data, which increases the 
running time of GSSClu based approaches. For example, in 
the CORA data set, each graph has 3.3 nodes in average, 
while it has 4.3 citations and 6.1 terms as the side infor- 
mation in average. In order to process such large number 
of additional side information, it is natural that the GSS- 
Clu based approaches consume more running time than the 
GMicro approach. Considering GSSClu based approaches, 
GSSClu and GSSClu [w/o opt.] process the same amount 
of data. It is evident that both GSSClu and GSSClu [w/o 
opt] maintains a relatively stable processing rate with the 
progression of the stream. The figures show that the sketch- 
based statistics SGS(C) can indeed process stream data with 
high efficiency because the statistics remain the same mem- 
ory consumption with constantly growing received graph ob- 
jects. The low variability in processing rate is a clear advan- 
tage for use in practice. We further notice that GSSClu [w/o 
opt.] is slightly faster than GSSClu. This is quite natural, 
since GSSClu requires to periodically optimize and adjust the 
weights of links and side attributes. Since the optimization 
framework DMO is solved in the sketch representation, the 
optimization of GSSClu only adds a slight overhead on the 
running time. Considering the tremendous effectiveness im- 
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provement of the proposed approach GSSClu, the overhead 
of running time is quite acceptable. 

5.6 Sensitivity Analysis Results In order to study how the 
parameter can affect the performance of GSSClu, we further 
conduct sensitivity analysis with respect to 7. As shown 
previously, 7 determines the frequency to update and adjust 
the weights in the E-S distance computation. Thus, it is 
reasonable to optimize the weights for every a few hundred 
received graphs. In Figure [5] we present the effectiveness 
results on both data sets with three variations of 7 values, 
namely 250, 500 and 1000. The number of graphs processed 
is shown on the X-axis, and the cluster purity is illustrated 
on the Y-axis. In order to present the detailed differences 
among different settings of 7, the cluster purity is plotted 
on a more enlarged scale. From both figures, it is evident 
that GSSClu maintains stable quality for a wide range of 
parameter 7 settings. In the meanwhile, a smaller 7 value 
slightly improves the cluster purity. The reason is that the 
weights and E-S distances can be adjusted more promptly 
under a smaller 7 setting to adapt the incoming data. All 
these suggest that our proposed method is not sensitive to 
the setting of 7 with respect to the effectiveness and GSSClu 
is quite robust with the progression of the stream. 

We also test the efficiency of the proposed method over 
different settings of 7. Similar to Figure |4] we plot the time 
in seconds on the X-axis, and the processing rate with re- 
spect to edges per second on the Y-axis in Figure [6] We use 
a smaller granularity on the processing rate than that of Fig- 
ure|4]to show the slight differences among three variations in 
7. Based on the results shown on both figures, the processing 
rates for different settings of 7 are relatively stable with the 
increasing number of received graphs. Furthermore, it is ev- 
ident that a smaller value of 7 can lead to lower processing 
rate of GSSClu. This is quite natural, since more frequent 
optimization consumes more running time. However, the 
overhead of the optimization framework DMO is minimal 
and does not affect the overall efficiency much. This sug- 
gests that the GSSClu approach is an efficient and scalable 
algorithm over a wide range of parameter 7 settings. 

We further perform the sensitivity analysis with regard 
to the number of clusters k. We first show the effectiveness 
results with the number of clusters in Figure |7] We present 
the number of clusters k on the X-axis, and the cluster purity 
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score of the whole data set on the Y-axis. It is evident that the 
cluster purity score increases when we increase the number 
of clusters. The reason is that a larger number of clusters will 
generate clusters with finer granularity. In the meanwhile, 
we can observe that the cluster purity score increases only 
by about 0.03 even when the number of clusters is doubled. 
In other words, the cluster purity is highly consistent across 
all settings of k. This suggests that the proposed approach 
GSSClu can constantly perform well under a variety of k 
settings, and its effectiveness is not sensitive to the value of 
the number of clusters. 

The efficiency results with the number of clusters k are 
illustrated in Figure [8] The number of clusters is shown on 
the X-axis, and the processing rate of the whole data set 
is presented on the Y-axis. We test the efficiencies with k 
ranging from 10 to 18 for the CORA data set and from 6 
to 14 for the IMDB data set. From the results on both data 
sets, it is clear that the GSSClu approach scales linearly with 
the number of clusters k in terms of efficiency. Specifically, 
the smaller the number of clusters, the higher the processing 
rate achieved. This is because the distance computation 
in GSSClu scales linearly with the increasing number of 
clusters. 

6 Conclusion 

In this paper, we present the first approach to cluster graph 
streams with side information. While many approaches have 
been devised to mine graph streams, they solely focus on the 
link structures of graphs. Many graph objects in real applica- 
tions contain various forms of side information, which may 
be used to improve the clustering process. The problem is 
challenging, because not only it requires to process high vol- 
ume links and side information with efficiency, but also it 
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is non-trivial to incorporate side attributes to the graph clus- 
tering process. In order to use both links and side attributes 
for the clustering model, we define a unified distance met- 
ric E-S Distance based on edges and side information. We 
further propose an optimization framework DMO to dynam- 
ically refine the distance metric by measuring the inter and 
intra cluster distances. A sketch-based framework SGS{C) 
is also introduced to store the statistics of both edges and 
side information. We demonstrate that SGS(C) can not only 
estimate the measures used in the clustering algorithm, but 
also solve the optimization framework DMO efficiently. The 
experiment results show that the proposed method signifi- 
cantly outperforms the baselines in terms of effectiveness, 
while it also maintains high efficiency and scalability. In our 
future work, we will consider using side information to im- 
prove other graph stream mining tasks, including classifica- 
tion, outlier detection and query processing. 
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